Using the dataset obtained from FSU’s Florida Climate Center, for a station at Tampa International Airport (TPA) for 2022, attempt to recreate the charts shown below which were generated using data from 2016. You can read the 2022 dataset using the code below:
library(tidyverse)
weather_tpa <- read_csv("https://raw.githubusercontent.com/reisanar/datasets/master/tpa_weather_2022.csv")
# random sample
sample_n(weather_tpa, 4)
## # A tibble: 4 × 7
## year month day precipitation max_temp min_temp ave_temp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022 1 28 0.00001 67 54 60.5
## 2 2022 9 24 0 92 76 84
## 3 2022 5 29 0 92 72 82
## 4 2022 5 17 0 89 75 82
summary(weather_tpa)
## year month day precipitation
## Min. :2022 Min. : 1.000 Min. : 1.00 Min. :0.0000
## 1st Qu.:2022 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:0.0000
## Median :2022 Median : 7.000 Median :16.00 Median :0.0000
## Mean :2022 Mean : 6.526 Mean :15.72 Mean :0.1697
## 3rd Qu.:2022 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:0.0300
## Max. :2022 Max. :12.000 Max. :31.00 Max. :2.8600
## max_temp min_temp ave_temp
## Min. :45.00 Min. :31.00 Min. :38.00
## 1st Qu.:80.00 1st Qu.:63.00 1st Qu.:71.00
## Median :87.00 Median :70.00 Median :78.00
## Mean :84.54 Mean :68.21 Mean :76.37
## 3rd Qu.:92.00 3rd Qu.:77.00 3rd Qu.:84.00
## Max. :98.00 Max. :83.00 Max. :89.50
See https://www.reisanar.com/slides/relationships-models#10
for a reminder on how to use this type of dataset with the
lubridate package for dates and times (example included in
the slides uses data from 2016).
Using the 2022 data:
Hint: the option binwidth = 3 was used with the
geom_histogram() function.
library(ggplot2)
library(RColorBrewer)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)
weather_tpa <- read_csv("https://raw.githubusercontent.com/reisanar/datasets/master/tpa_weather_2022.csv")
## Rows: 365 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): year, month, day, precipitation, max_temp, min_temp, ave_temp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_n(weather_tpa, 4)
## # A tibble: 4 × 7
## year month day precipitation max_temp min_temp ave_temp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022 1 14 0 72 55 63.5
## 2 2022 9 29 0 77 65 71
## 3 2022 11 7 0 87 70 78.5
## 4 2022 12 25 0 46 31 38.5
tpa_clean <- weather_tpa %>%
unite("doy", year, month, day, sep="-") %>%
mutate(date=ymd(doy),
max_temp=as.double(max_temp),
min_temp=as.double(min_temp))
tpa_months <- tpa_clean %>%
mutate(month_num=month(date),
month_abb=month(date, label=TRUE),
month_name=month(date, label=TRUE, abbr=FALSE))
tpa_months %>%
ggplot(aes(x=max_temp, na.rm=T)) +
geom_histogram(aes(fill = month_name, position = "dodge"), binwidth=3, color="white") +
scale_fill_viridis_d() +
labs(x = "Max Temperature", y = "Number of Days") +
facet_wrap(~ month_name, ncol = 4) +
xlim(c(60, 90)) +
ylim(c(0,20)) +
theme_bw() +
theme(legend.position="")
## Warning in geom_histogram(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown aesthetics: position
## Warning: Removed 116 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 24 rows containing missing values (`geom_bar()`).
tpa_clean %>%
ggplot(aes(x=max_temp)) +
geom_density(bw=0.5, fill="darkgray") +
xlim(c(60,90)) +
scale_y_continuous(breaks=seq(0, 0.08, by=0.02)) +
scale_fill_manual(name="Area") +
labs(x="Maximum temperature") +
theme_bw()
## Warning: Removed 116 rows containing non-finite values (`stat_density()`).
Hint: check the kernel parameter of the
geom_density() function, and use bw = 0.5.
Hint: default options for geom_density() were used.
tpa_months %>%
ggplot(aes(x=max_temp, na.rm=T)) +
geom_density(aes(fill=month_name, position="dodge"), binwidth=3, color="black", alpha=0.5) +
scale_fill_viridis_d() +
labs(title="Density plot for each month in 2022", x="Maximum temperatures", y="") +
facet_wrap(~ month_name, ncol=4) +
xlim(c(60, 90)) +
ylim(c(0,0.25)) +
theme_bw() +
theme(legend.position="")
## Warning in geom_density(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown parameters: `binwidth`
## Warning in geom_density(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown aesthetics: position
## Warning: Removed 116 rows containing non-finite values (`stat_density()`).
Hint: use the{ggridges} package, and the
geom_density_ridges() function paying close attention to
the quantile_lines and quantiles parameters.
The plot above uses the plasma option (color scale) for the
viridis palette.
library(ggridges)
tpa_months %>%
ggplot(aes(x=max_temp, y=month_name, fill=stat(x))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE,
quantiles = 2, quantile_lines = TRUE) +
labs(x="Maximum temperature (in Fahrenheit degrees)", y="") +
scale_fill_viridis_c(name = "", option = "C") +
theme_minimal()
## Warning: `stat(x)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## Picking joint bandwidth of 1.93
## Warning: Using the `size` aesthietic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
int_plot <- tpa_months %>%
ggplot(aes(x=max_temp, y=precipitation, color=month_name)) +
geom_point() +
labs(x="Maximum temperature", y="Precipitation", title="Precipitation vs. maximum temperature", subtitle="Tampa International Airport - 2022", color="") +
theme_bw()
interactive_plot <- ggplotly(int_plot)
interactive_plot
In this plot, I wanted to include an interactive visualization since they’re my favorite.
As someone who grew up in Florida, I can definitely confirm we have more rain during the warmer seasons, especially this year. Breaking down the graph by months, and we can see when the cold/warm fronts normally hit in the fall and spring.
htmlwidgets::saveWidget(interactive_plot, "precipitation.html")
You can choose to work on either Option (A) or Option (B). Remove from this template the option you decided not to work on.
Review the set of slides (and additional resources linked in it) for visualizing text data: https://www.reisanar.com/slides/text-viz#1
Choose any dataset with text data, and create at least one visualization with it. For example, you can create a frequency count of most used bigrams, a sentiment analysis of the text data, a network visualization of terms commonly used together, and/or a visualization of a topic modeling approach to the problem of identifying words/documents associated to different topics in the text data you decide to use.
Make sure to include a copy of the dataset in the data/
folder, and reference your sources if different from the ones listed
below:
(to get the “raw” data from any of the links listed above, simply
click on the raw button of the GitHub page and copy the URL
to be able to read it in your computer using the read_csv()
function)
library(tidytext)
Florida Poly news articles are actually something I frequently work with at my job, so this dataset combined with my interest in sentiment analysis. I wanted to visualize the sentiment of words across a negative-positive range, so I used AFINN to accomplish this.
poly_news <- read.csv("https://raw.githubusercontent.com/reisanar/datasets/master/flpoly_news_SP23.csv")
poly_sentiment <- poly_news %>%
unnest_tokens(word, news_title) %>%
mutate(word_count = 1:n(),
index = word_count %/% 500 + 1) %>%
inner_join(get_sentiments("afinn"))
## Joining, by = "word"
poly_months <- poly_sentiment %>%
mutate(month_num=month(news_date),
month_abb=month(news_date, label=TRUE),
month_name=month(news_date, label=TRUE, abbr=FALSE),
)
I tokenized the data down to individual words, and selected sentiments based on AFINN. I then ammended the months to get the full month names from their corresponding numbers. I then plugged this into ggplot and created side-by-side month graphs to show the variation in sentiment across each month.
poly_months %>%
ggplot(aes(x=value, fill=month_name)) +
geom_bar() +
guides(fill = FALSE) +
labs(x = "Sentiment Value", y = "Count", title="Sentiment analysis on Florida Polytechnic University news across each month") +
facet_wrap(~ month_name, ncol = 4, scales="free_x") +
theme_gray() +
scale_fill_viridis_d() +
xlim(c(-5,5))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## Warning: Removed 1 rows containing missing values (`geom_bar()`).
- The results I’ve inferred from this visualization is that January,
February, September, and October frequently spike at a 2.5 sentiment
value. This may be due to these months being the first two months during
Fall and Spring semester, meaning there is frequently positive news at
the beginning of the main semesters. I honestly expected a spike in May
for graduation, but that possibly may be reflected earlier, in
April.